Cities Institute R Workshop Day 2 - Census

Stephen Rouse

Census Data & the CanCensus Package

Census data is central in situating analysis, whether we want to make use of census data directly, mix it with our own data or just use it to calibrate external data we have.

In this workshop we’ll explore how to work with census data and use it in conjunction with our own data.

Census Data

Census data offers rich variables at high spatial resolution, but at coarse time intervals.

Richer data comes at a price: Data discovery and acquisition is more complex. Enter CensusMapper.

CensusMapper is a flexible census data mapping platform. Anyone can explore and map census data.

CensusMapper is also an API server to facilitate data acquisition for analysis, as a GUI data selection tool.

CanCensus Package

The cancensus R package interfaces with the CensusMapper API server. It can be queried for:

  • census geographies

  • census data

  • hierarchical metadata of census variables

  • some non-census data that comes on census geographies, e.g. T1FF taxfiler data

Setting up API Key

A slight complication, the cancensus package needs an API key. You can sign up for one on CensusMapper, and install it using the set_api_key function with the install=TRUE option so it’s always available and won’t expose your API key when sharing code.

#new_packages <- c("cancensus","tongfen")
#install.packages(new_packages)
library(tidyverse)
library(cansim)
library(cancensus)
library(tongfen)

#set_cancensus_api_key("Census_mapper_1240815019385", install = TRUE) 

This code is commented-out for now because you only need to run it once. When you get your API key, place it above and run that section of code without the # sign. It should look something like this:

This will install the API key as a system variable in your .Renviron so that it’s available in every R session and you won’t expose your API key when sharing code.

Setting up the Cache

This step will store any census data you grab locally, so that you don’t have to keep re-downloading it every time. It’s really helpful for speeding up code when you’re working with bigger tables, especially ones that you use often.

#set_cancensus_cache_path("~/Econ/Cancensus Cache",install=TRUE,overwrite = TRUE)
show_cancensus_cache_path()
[1] "~/Econ/Cancensus Cache"

You’ll see a use_cache argument in most of the functions we use next. That determines whether or not the code uses the data from your local cache or not. It’s set to TRUE by default, so it uses the data already stored on your computer if it’s already there.

To force cancensus to refresh the data and re-download it from StatCan, you can specify use_cache = FALSE as a parameter for the functions we’ll learn about next.

Testing it out

  • cancensus provides three different functions for retrieving Census data:

    • get_census_data to retrieve Census data only as a flat data frame
    census_data <- get_census(dataset='CA21', regions=list(CMA="48835"), vectors=c("v_CA21_434"), level='CSD', use_cache = FALSE, geo_format = NA, quiet = TRUE)
    
    summary(census_data)
        GeoUID           Type                 Region Name  Area (sq km)      
     Length:34          CSD:34   Alexander 134 (IRI): 1   Min.   :   0.1922  
     Class :character            Beaumont (CY)      : 1   1st Qu.:   2.1634  
     Mode  :character            Betula Beach (SV)  : 1   Median :  10.3084  
                                 Bon Accord (T)     : 1   Mean   : 276.9466  
                                 Bruderheim (T)     : 1   3rd Qu.:  50.6092  
                                 Calmar (T)         : 1   Max.   :2502.5902  
                                 (Other)            :28                      
       Population        Dwellings          Households          CD_UID         
     Min.   :     18   Min.   :     5.0   Min.   :     5.0   Length:34         
     1st Qu.:    355   1st Qu.:   287.2   1st Qu.:   147.8   Class :character  
     Median :   1643   Median :   600.0   Median :   566.5   Mode  :character  
     Mean   :  41709   Mean   : 17339.8   Mean   : 16136.0                     
     3rd Qu.:  19544   3rd Qu.:  7398.2   3rd Qu.:  7003.2                     
     Max.   :1010899   Max.   :428857.0   Max.   :396404.0                     
    
        PR_UID            CMA_UID         
     Length:34          Length:34         
     Class :character   Class :character  
     Mode  :character   Mode  :character  
    
    
    
    
     v_CA21_434: Occupied private dwellings by structural type of dwelling data
     Min.   :    25.0                                                          
     1st Qu.:   376.2                                                          
     Median :  1055.0                                                          
     Mean   : 19590.7                                                          
     3rd Qu.:  7956.2                                                          
     Max.   :396400.0                                                          
     NA's   :6                                                                 
    • get_census_geometry to retrieve Census geography only as a collection of spatial polygons
    census_data <- get_census(dataset='CA21', regions=list(CMA="48835"),
                              vectors=c("v_CA21_434"),
                              level='CSD', use_cache = FALSE, geo_format = 'sf', quiet = TRUE)
    summary(census_data)
       CMA_UID             CD_UID              name           Dwellings 2016  
     Length:34          Length:34          Length:34          Min.   :     5  
     Class :character   Class :character   Class :character   1st Qu.:   308  
     Mode  :character   Mode  :character   Mode  :character   Median :   604  
                                                              Mean   : 15813  
                                                              3rd Qu.:  6719  
                                                              Max.   :388254  
    
       Population        Dwellings           PR_UID          Population 2016   
     Min.   :     18   Min.   :     5.0   Length:34          Min.   :    10.0  
     1st Qu.:    355   1st Qu.:   287.2   Class :character   1st Qu.:   301.5  
     Median :   1643   Median :   600.0   Mode  :character   Median :  1641.0  
     Mean   :  41709   Mean   : 17339.8                      Mean   : 38865.9  
     3rd Qu.:  19544   3rd Qu.:  7398.2                      3rd Qu.: 17390.0  
     Max.   :1010899   Max.   :428857.0                      Max.   :933088.0  
    
       Households        Type       GeoUID          Households 2016   
     Min.   :     5.0   CSD:34   Length:34          Min.   :     5.0  
     1st Qu.:   147.8            Class :character   1st Qu.:   126.8  
     Median :   566.5            Mode  :character   Median :   534.0  
     Mean   : 16136.0                               Mean   : 14769.1  
     3rd Qu.:  7003.2                               3rd Qu.:  6394.2  
     Max.   :396404.0                               Max.   :361033.0  
    
     Quality Flags        Shape Area                     Region Name
     Length:34          Min.   :   0.1922   Alexander 134 (IRI): 1  
     Class :character   1st Qu.:   2.1634   Beaumont (CY)      : 1  
     Mode  :character   Median :  10.3084   Betula Beach (SV)  : 1  
                        Mean   : 276.9466   Bon Accord (T)     : 1  
                        3rd Qu.:  50.6092   Bruderheim (T)     : 1  
                        Max.   :2502.5902   Calmar (T)         : 1  
                                            (Other)            :28  
      Area (sq km)      
     Min.   :   0.1922  
     1st Qu.:   2.1634  
     Median :  10.3084  
     Mean   : 276.9466  
     3rd Qu.:  50.6092  
     Max.   :2502.5902  
    
     v_CA21_434: Occupied private dwellings by structural type of dwelling data
     Min.   :    25.0                                                          
     1st Qu.:   376.2                                                          
     Median :  1055.0                                                          
     Mean   : 19590.7                                                          
     3rd Qu.:  7956.2                                                          
     Max.   :396400.0                                                          
     NA's   :6                                                                 
              geometry 
     MULTIPOLYGON :34  
     epsg:4326    : 0  
     +proj=long...: 0  
    
    
    
    

Testing it out

get_census  is used to retrieve Census data and geography as a spatial dataset together

census_data <- get_census(dataset='CA21', regions=list(CMA="48835"),
                          vectors=c("v_CA21_434"),
                          level='CSD', use_cache = FALSE, geo_format = 'sp', quiet = TRUE)

head(census_data) %>% knitr::kable()
CMA_UID CD_UID name Dwellings.2016 Population Dwellings PR_UID Population.2016 Households Type GeoUID Households.2016 Quality.Flags Shape.Area Region.Name Area..sq.km. v_CA21_434..Occupied.private.dwellings.by.structural.type.of.dwelling.data
48835 4810 Bruderheim (T) 629 1329 552 48 1323 515 CSD 4810066 507 0 9.2781 Bruderheim (T) 9.2781 515
48835 4811 Leduc County (MD) 5621 14416 5990 48 13177 5295 CSD 4811012 4875 0 2502.5902 Leduc County (MD) 2502.5902 5295
48835 4811 Beaumont (CY) 6015 20888 7168 48 17457 6950 CSD 4811013 5654 0 24.7019 Beaumont (CY) 24.7019 6950
48835 4811 Leduc (CY) 12264 34094 13507 48 29993 12964 CSD 4811016 11319 0 42.2532 Leduc (CY) 42.2532 12960
48835 4811 Devon (T) 2493 6545 2588 48 6578 2496 CSD 4811018 2415 0 14.2554 Devon (T) 14.2554 2495
48835 4811 Calmar (T) 861 2183 937 48 2228 893 CSD 4811019 842 0 4.6686 Calmar (T) 4.6686 895

Census Datasets

Cancensus can access Statistics Canada Census data for Census years 1996, 2001, 2006, 2011, 2016, and 2021. You can run list_census_datasets to check what datasets are currently available for access through the CensusMapper API.

list_census_datasets() 
# A tibble: 29 × 6
   dataset description           geo_dataset attribution reference reference_url
   <chr>   <chr>                 <chr>       <chr>       <chr>     <chr>        
 1 CA1996  1996 Canada Census    CA1996      StatCan 19… 92-351-U  https://www1…
 2 CA01    2001 Canada Census    CA01        StatCan 20… 92-378-X  https://www1…
 3 CA06    2006 Canada Census    CA06        StatCan 20… 92-566-X  https://www1…
 4 CA11    2011 Canada Census a… CA11        StatCan 20… 98-301-X… https://www1…
 5 CA16    2016 Canada Census    CA16        StatCan 20… 98-301-X  https://www1…
 6 CA21    2021 Canada Census    CA21        StatCan 20… 98-301-X  https://www1…
 7 CA01xSD 2001 Canada Census x… CA01        StatCan 20… 92-378-X  https://www1…
 8 CA06xSD 2006 Canada Census x… CA06        StatCan 20… 92-566-X  https://www1…
 9 CA11xSD 2011 Canada Census x… CA11        StatCan 20… 98-301-X  https://www1…
10 CA16xSD 2016 Canada Census x… CA16        StatCan 20… 98-301-X  https://www1…
# ℹ 19 more rows

Census Regions

Census data is aggregated at multiple geographic levels. Census geographies at the national (C), provincial (PR), census metropolitan area (CMA), census agglomeration (CA), census division (CD), and census subdivision (CSD) are defined as named census regions.

Canadian Census geography can change in between Census periods. Cancensus provides a function, list_census_regions(dataset), to display all named census regions and their corresponding id for a given census dataset.

list_census_regions("CA21")
# A tibble: 5,518 × 8
   region name               level    pop municipal_status CMA_UID CD_UID PR_UID
   <chr>  <chr>              <chr>  <int> <chr>            <chr>   <chr>  <chr> 
 1 01     Canada             C     3.70e7 <NA>             <NA>    <NA>   <NA>  
 2 35     Ontario            PR    1.42e7 Ont.             <NA>    <NA>   <NA>  
 3 24     Quebec             PR    8.50e6 Que.             <NA>    <NA>   <NA>  
 4 59     British Columbia   PR    5.00e6 B.C.             <NA>    <NA>   <NA>  
 5 48     Alberta            PR    4.26e6 Alta.            <NA>    <NA>   <NA>  
 6 46     Manitoba           PR    1.34e6 Man.             <NA>    <NA>   <NA>  
 7 47     Saskatchewan       PR    1.13e6 Sask.            <NA>    <NA>   <NA>  
 8 12     Nova Scotia        PR    9.69e5 N.S.             <NA>    <NA>   <NA>  
 9 13     New Brunswick      PR    7.76e5 N.B.             <NA>    <NA>   <NA>  
10 10     Newfoundland and … PR    5.11e5 N.L.             <NA>    <NA>   <NA>  
# ℹ 5,508 more rows

Revisiting the code for Edmonton

library(cancensus)
list_census_regions('CA21') %>% filter(level=="CMA", name=="Edmonton")
# A tibble: 1 × 8
  region name     level     pop municipal_status CMA_UID CD_UID PR_UID
  <chr>  <chr>    <chr>   <int> <chr>            <chr>   <chr>  <chr> 
1 48835  Edmonton CMA   1418118 B                <NA>    <NA>   48    
census_data <- get_census(dataset='CA21', regions=list(CMA="48835"),
                          vectors="v_CA21_434",
                          level='CSD', use_cache = FALSE, quiet = TRUE)

This is how we got to the function from above. To grab the same data for Edmonton, broken down into a smaller geographic level, we can slightly modify the level= argument.

census_data <- get_census(dataset='CA21', regions=list(CMA="48835"),
                          vectors="v_CA21_434",
                          level='CT', use_cache = FALSE, quiet = TRUE)

Working with Census Variables

Census data contains thousands of different geographic regions as well as thousands of unique variables. In addition to enabling programmatic and reproducible access to Census data, cancensus has a number of tools to help users find the data they are looking for.

You can run the following code to view all available Census variables for a given dataset:

list_census_vectors("CA21")
# A tibble: 7,709 × 7
   vector    type   label                units parent_vector aggregation details
   <chr>     <fct>  <chr>                <fct> <chr>         <chr>       <chr>  
 1 v_CA21_1  Total  Population, 2021     Numb… <NA>          Additive    CA 202…
 2 v_CA21_2  Total  Population, 2016     Numb… <NA>          Additive    CA 202…
 3 v_CA21_3  Total  Population percenta… Numb… <NA>          Average of… CA 202…
 4 v_CA21_4  Total  Total private dwell… Numb… <NA>          Additive    CA 202…
 5 v_CA21_5  Total  Private dwellings o… Numb… v_CA21_4      Additive    CA 202…
 6 v_CA21_6  Total  Population density … Ratio <NA>          Average of… CA 202…
 7 v_CA21_7  Total  Land area in square… Numb… <NA>          Additive    CA 202…
 8 v_CA21_8  Total  Total - Age          Numb… <NA>          Additive    CA 202…
 9 v_CA21_9  Male   Total - Age          Numb… <NA>          Additive    CA 202…
10 v_CA21_10 Female Total - Age          Numb… <NA>          Additive    CA 202…
# ℹ 7,699 more rows

Working with Census Variables

For each variable (vector) in that Census dataset, this shows:

  • Vector: short variable code

  • Type: variables are provided as aggregates of female responses, male responses, or total (male+female) responses

  • Label: detailed variable name

  • Units: provides information about whether the variable represents a count integer, a ratio, a percentage, or a currency figure

  • Parent_vector: shows the immediate hierarchical parent category for that variable, where appropriate

  • Aggregation: indicates how the variable should be aggregated with others, whether it is additive or if it is an average of another variable

  • Description: a rough description of a variable based on its hierarchical structure. This is constructed by cancensus by recursively traversing the labels for every variable’s hierarchy, and facilitates searching for specific variables using key terms

Finding Variables to Work With

As you can tell, it’s pretty hard to find a dataset on your own just by browsing that list. Cancensus uses the find_census_vectors() function to help with that.

find_census_vectors("Australia",dataset="CA21",type="total",query_type="exact") 
# A tibble: 3 × 4
  vector      type  label      details                                          
  <chr>       <fct> <chr>      <chr>                                            
1 v_CA21_4812 Total Australia  25% Data; Citizenship and immigration; Total - P…
2 v_CA21_5223 Total Australian 25% Data; Visible minority and ethnic origin; To…
3 v_CA21_6483 Total Australia  25% Data; Education; Total - Location of study c…
find_census_vectors("Australia origin",dataset="CA21",type="total",query_type="semantic") 
# A tibble: 251 × 4
   vector      type  label                                               details
   <chr>       <fct> <chr>                                               <chr>  
 1 v_CA21_4917 Total Total - Ethnic or cultural origin for the populati… 25% Da…
 2 v_CA21_4920 Total Canadian                                            25% Da…
 3 v_CA21_4923 Total English                                             25% Da…
 4 v_CA21_4926 Total Irish                                               25% Da…
 5 v_CA21_4929 Total Scottish                                            25% Da…
 6 v_CA21_4932 Total French, n.o.s.                                      25% Da…
 7 v_CA21_4935 Total German                                              25% Da…
 8 v_CA21_4938 Total Chinese                                             25% Da…
 9 v_CA21_4941 Total Italian                                             25% Da…
10 v_CA21_4944 Total Indian (India)                                      25% Da…
# ℹ 241 more rows

The “exact” search is very precise, but you can miss out on key tables if you don’t know exactly what you’re looking for.

Finding Variables to Work With

One other search option is “keyword”, which looks for the highest number of matches. It also has an interactive option that you can play around with:

find_census_vectors("Australian ethnic", dataset = "CA21", type = "total", query_type = "keyword", interactive = FALSE)
# A tibble: 1 × 4
  vector      type  label      details                                          
  <chr>       <fct> <chr>      <chr>                                            
1 v_CA21_5223 Total Australian 25% Data; Visible minority and ethnic origin; To…

Looking at Poverty in Edmonton

list_census_regions('CA21') %>% filter(level=="CMA", name=="Edmonton")
# A tibble: 1 × 8
  region name     level     pop municipal_status CMA_UID CD_UID PR_UID
  <chr>  <chr>    <chr>   <int> <chr>            <chr>   <chr>  <chr> 
1 48835  Edmonton CMA   1418118 B                <NA>    <NA>   48    
find_census_vectors("Low Income Measures",dataset="CA21",type="total",query_type="semantic")  %>% knitr::kable()
vector type label details
v_CA21_1025 Total In low income based on the Low-income measure, after tax (LIM-AT) Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT)
v_CA21_1028 Total 0 to 17 years Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 0 to 17 years
v_CA21_1031 Total 0 to 5 years Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 0 to 17 years; 0 to 5 years
v_CA21_1034 Total 18 to 64 years Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 18 to 64 years
v_CA21_1037 Total 65 years and over Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 65 years and over
v_CA21_1040 Total Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%) Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%)
v_CA21_1043 Total 0 to 17 years Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 0 to 17 years
v_CA21_1046 Total 0 to 5 years Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 0 to 17 years; 0 to 5 years
v_CA21_1049 Total 18 to 64 years Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 18 to 64 years
v_CA21_1052 Total 65 years and over Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 65 years and over

Mapping the Data

pv <- c(lico_at="v_CA21_1028")

poverty_data <- get_census("CA21", regions=list(CMA="48835"), vectors=pv, geo_format="sf", level="CT")

ggplot(poverty_data,aes(fill=lico_at)) +
  geom_sf(size=NA) +  
  labs(title="Number of children in poverty - Edmonton",fill=NULL,caption="StatCan Census 2021")

Looking at Poverty (%)

find_census_vectors("Low Income Measures %",dataset="CA21",type="total",query_type="semantic")  %>% knitr::kable()
vector type label details
v_CA21_1025 Total In low income based on the Low-income measure, after tax (LIM-AT) Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT)
v_CA21_1028 Total 0 to 17 years Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 0 to 17 years
v_CA21_1031 Total 0 to 5 years Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 0 to 17 years; 0 to 5 years
v_CA21_1034 Total 18 to 64 years Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 18 to 64 years
v_CA21_1037 Total 65 years and over Income; Low income status; LIM-AT; In low income based on the Low-income measure, after tax (LIM-AT); 65 years and over
v_CA21_1040 Total Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%) Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%)
v_CA21_1043 Total 0 to 17 years Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 0 to 17 years
v_CA21_1046 Total 0 to 5 years Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 0 to 17 years; 0 to 5 years
v_CA21_1049 Total 18 to 64 years Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 18 to 64 years
v_CA21_1052 Total 65 years and over Income; Low income status; LIM-AT; Prevalence of low income based on the Low-income measure, after tax (LIM-AT) (%); 65 years and over

Mapping the Data

pv <- c(lico_at="v_CA21_1043")

poverty_data <- get_census("CA21", regions=list(CMA="48835"), vectors=pv, geo_format="sf", level="CT")

ggplot(poverty_data,aes(fill=lico_at/100)) +
  geom_sf(size=NA) +  
  labs(title="Share of children in poverty - Edmonton",fill=NULL,caption="StatCan Census 2021")

Adapting for another Region

list_census_regions('CA21') %>% filter(level=="CMA", name=="Vancouver")
# A tibble: 1 × 8
  region name      level     pop municipal_status CMA_UID CD_UID PR_UID
  <chr>  <chr>     <chr>   <int> <chr>            <chr>   <chr>  <chr> 
1 59933  Vancouver CMA   2642825 B                <NA>    <NA>   59    
poverty_data <- get_census("CA21", regions=list(CMA="59933"), vectors=pv, geo_format="sf", level="CT") 

ggplot(poverty_data,aes(fill=lico_at/100)) +
  geom_sf(size=NA) +  
  labs(title="% of children in poverty - Vancouver",fill=NULL,caption="StatCan Census 2021")

Re-mapping with a Different Theme

If we want to add different themes to the graph, there are a lot of options to choose from:

poverty_data <- get_census("CA21", regions=list(CMA="59933"), vectors=pv, geo_format="sf", level="CT") 
ggplot(poverty_data,aes(fill=lico_at/100)) +
  geom_sf(size=NA) +  
  labs(title="% of children in poverty - Vancouver",fill=NULL,caption="StatCan Census 2021") +scale_fill_viridis_c(option = "inferno",labels=scales::percent)

Mixing data sources

Mixing data sources is hard, especially when dealing with spatial data.

If spatial units match across datasets, it is easy to compare the other data we have. If spatial units don’t match, things get complicated. And annoying.

Simple example is LFS time series for CMAs. We don’t have one long time series but several partially overlapping shorter time series. The reason is that CMA geography changes over time, so we can’t directly compare data when geography (denominators) change.

Mixing Data when Spatial Units Match

The cansim package returns a geographic identifier GeoUID that matches census identifiers returned by cancensus. That makes matching data from those two data sources relatively easy.

We’ll try out an example here with an Income Distribution measure from StatCan:

income_distribution <- get_cansim("11-10-0074") %>% select(GeoUID,`D-index`=VALUE)

list_census_regions('CA16') %>% filter(level=="CMA", name=="Toronto")
# A tibble: 1 × 8
  region name    level     pop municipal_status CMA_UID CD_UID PR_UID
  <chr>  <chr>   <chr>   <int> <chr>            <chr>   <chr>  <chr> 
1 35535  Toronto CMA   5928040 B                <NA>    <NA>   35    
toronto <- get_census("CA16",regions=list(CMA="35535"),geo_format = 'sf',level="CT")
  
merged_data <- left_join(toronto,income_distribution, by="GeoUID") 


merged_data %>%
  ggplot(aes(fill=`D-index`)) +
  geom_sf(size=0.1) + scale_fill_viridis_c() +
  coord_sf(datum=NA,xlim=c(-79.8,-79.15),ylim=c(43.6,43.8)) +
  labs(title="Income divergence index", caption="StatCan table 11-10-0074")

Mixing Census Data Across Years

Census geographies often change over time, which complicates comparisons using more than one year of data.

The best way to deal with this is a custom data request, but that takes time, costs money and is overkill for many applications. An immediate way to achieve almost the same result is using the tongfen package.

Tongfen ensures that while the spatial units change, they are still comparable. In other words, they’re derived from one another by a (generally short) series of split and join operations.

Tongfen & CensusMapper

With this example, we’ll take a look at how the number of children in Toronto has changed over time. We’ll use the CensusMapper API GUI to select the Census vectors that we need. We’ll get data on the children under 15, assembled from 5 year age groups for males and females for 2001 and 2021.

search_census_regions("Toronto","CA21")
# A tibble: 3 × 8
  region  name    level     pop municipal_status CMA_UID CD_UID PR_UID
  <chr>   <chr>   <chr>   <int> <chr>            <chr>   <chr>  <chr> 
1 35535   Toronto CMA   6202225 B                <NA>    <NA>   35    
2 3520    Toronto CD    2794356 CDR              <NA>    <NA>   35    
3 3520005 Toronto CSD   2794356 C                35535   3520   35    
census_data <- get_census(dataset='CA21', regions=list(CSD="3520005"), vectors=c("v_CA21_11"), labels="detailed", geo_format=NA, level='CSD')

variables <- c(children_2021="v_CA21_11",
                                     children1m_2001="v_CA01_7",
                                     children2m_2001="v_CA01_8",
                                     children3m_2001="v_CA01_9",
                                     children1f_2001="v_CA01_26",
                                     children2f_2001="v_CA01_27",
                                     children3f_2001="v_CA01_28")

# "meta" vectors for Tongfen are the Census vectors that we need to compare over time
meta <- meta_for_ca_census_vectors(c(children_2021="v_CA21_11",
                                     children1m_2001="v_CA01_7",
                                     children2m_2001="v_CA01_8",
                                     children3m_2001="v_CA01_9",
                                     children1f_2001="v_CA01_26",
                                     children2f_2001="v_CA01_27",
                                     children3f_2001="v_CA01_28"))
toronto_children <- get_tongfen_ca_census(regions=list(CSD="3520005"), meta=meta, level = "CT", na.rm = TRUE)

Preparing the Data

Since the data from the 2001 Census is separated by 5 year age groups instead of 15, we’ll need to add them up to compare them with the 2021 Census Data.

plot_data <- toronto_children %>%
  mutate(children_2001=children1m_2001+children2m_2001+children3m_2001+
           children1f_2001+children2f_2001+children3f_2001) |>
  select(matches("children_\\d{4}|Population")) 

This fancy code at the end gets rid of all the exrtra columns we had, since the original table was pretty clunky. The first part of the matches() function here is keeping only columns that start with “children_” and have exactly 4 numbers at the end (e.g. - children_2021). The second part keeps the columns that have “Population” in their name.

Analysis & Visualization

ggplot(plot_data, aes(fill=children_2021-children_2001)) +
  geom_sf() +
  scale_fill_gradient2(labels=scales::comma) +
  coord_sf(datum=NA) +
  labs(title="City of Toronto change in number of children under 15 between 2001 to 2021",
       fill="Number of\nchildren",
       caption="StatCan Census 2001, 2021")

Mapping it again Differently

ggplot(plot_data, aes(fill=children_2021/Population_CA21-children_2001/Population_CA01)) +
  geom_sf() +
  scale_fill_gradient2(labels=scales::percent) +
  coord_sf(datum=NA) +
  labs(title="City of Toronto change in share of children under 15 between 2001 to 2021",
       fill="Percentage\npoint\nchange",
       caption="StatCan Census 2001, 2021")

TongFen with T1FF data - Advanced Example

T1FF taxfiler data is a rich source of demographic information, now available for census tracts. Variable naming is consistent across years, making it possible to programmatically assemble data.

library(stringr)
list_census_datasets() %>% filter(str_detect(description,"T1FF"))
# A tibble: 19 × 6
   dataset description           geo_dataset attribution reference reference_url
   <chr>   <chr>                 <chr>       <chr>       <chr>     <chr>        
 1 TX2000  2000 T1FF taxfiler d… CA1996      StatCan 20… 72-212-X  https://www1…
 2 TX2001  2001 T1FF taxfiler d… CA01        StatCan 20… 72-212-X  https://www1…
 3 TX2002  2002 T1FF taxfiler d… CA01        StatCan 20… 72-212-X  https://www1…
 4 TX2003  2003 T1FF taxfiler d… CA01        StatCan 20… 72-212-X  https://www1…
 5 TX2004  2004 T1FF taxfiler d… CA01        StatCan 20… 72-212-X  https://www1…
 6 TX2005  2005 T1FF taxfiler d… CA01        StatCan 20… 72-212-X  https://www1…
 7 TX2006  2006 T1FF taxfiler d… CA06        StatCan 20… 72-212-X  https://www1…
 8 TX2007  2007 T1FF taxfiler d… CA06        StatCan 20… 72-212-X  https://www1…
 9 TX2008  2008 T1FF taxfiler d… CA06        StatCan 20… 72-212-X  https://www1…
10 TX2009  2009 T1FF taxfiler d… CA06        StatCan 20… 72-212-X  https://www1…
11 TX2010  2010 T1FF taxfiler d… CA06        StatCan 20… 72-212-X  https://www1…
12 TX2011  2011 T1FF taxfiler d… CA06        StatCan 20… 72-212-X  https://www1…
13 TX2012  2012 T1FF taxfiler d… CA11        StatCan 20… 72-212-X  https://www1…
14 TX2013  2013 T1FF taxfiler d… CA11        StatCan 20… 72-212-X  https://www1…
15 TX2014  2014 T1FF taxfiler d… CA11        StatCan 20… 72-212-X  https://www1…
16 TX2015  2015 T1FF taxfiler d… CA11        StatCan 20… 72-212-X  https://www1…
17 TX2016  2016 T1FF taxfiler d… CA16        StatCan 20… 72-212-X  https://www1…
18 TX2017  2017 T1FF taxfiler d… CA16        StatCan 20… 72-212-X  https://www1…
19 TX2018  2018 T1FF taxfiler d… CA16        StatCan 20… 72-212-X  https://www1…
vectors_census_2004 <- list_census_vectors("TX2004")

find_census_vectors("all families",dataset="TX2004",query_type="semantic") 
# A tibble: 8 × 4
  vector       type  label                                             details  
  <chr>        <fct> <chr>                                             <chr>    
1 v_TX2004_607 Total All families (CF + LP) - #                        Tax data…
2 v_TX2004_608 Total Median total income                               Tax data…
3 v_TX2004_609 Total Person Median Income                              Tax data…
4 v_TX2004_610 Total # persons                                         Tax data…
5 v_TX2004_622 Total All families (CF + LP) with employment income - # Tax data…
6 v_TX2004_623 Total Median employment income                          Tax data…
7 v_TX2004_632 Total All families (CF + LP) - #                        Tax data…
8 v_TX2004_633 Total Median Amount                                     Tax data…

Advanced Example (continued)

find_census_vectors("# of families in low income",dataset="TX2004",query_type="semantic") 
# A tibble: 21 × 4
   vector       type  label                               details               
   <chr>        <fct> <chr>                               <chr>                 
 1 v_TX2004_786 Total # of families in Low Income - Total Tax data 2004; Table …
 2 v_TX2004_782 Total With 0 children                     Tax data 2004; Table …
 3 v_TX2004_783 Total With 1 child                        Tax data 2004; Table …
 4 v_TX2004_784 Total With 2 children                     Tax data 2004; Table …
 5 v_TX2004_785 Total With 3+ children                    Tax data 2004; Table …
 6 v_TX2004_796 Total # of families in Low Income - Total Tax data 2004; Table …
 7 v_TX2004_792 Total With 0 children                     Tax data 2004; Table …
 8 v_TX2004_793 Total With 1 child                        Tax data 2004; Table …
 9 v_TX2004_794 Total With 2 children                     Tax data 2004; Table …
10 v_TX2004_795 Total With 3+ children                    Tax data 2004; Table …
# ℹ 11 more rows
years <- seq(2004,2018)

variables <- setNames(c(paste0("v_TX",years,"_607"),              paste0("v_TX",years,"_786")),c(paste0("families_",years),paste0("lico_",years)))

meta <-meta_for_ca_census_vectors(variables)

low_income <- get_tongfen_ca_census(regions = list(CMA=59933), meta=meta, level="CT") 

low_income <- low_income %>%
  mutate(`2004-2018`=lico_2018/families_2018-lico_2004/families_2004,
         `2004-2011`=lico_2011/families_2011-lico_2004/families_2004,
         `2011-2018`=lico_2018/families_2018-lico_2011/families_2011)

Mapping the Graph

low_income_pivoted <- low_income %>% pivot_longer(starts_with("20"))

low_income_pivoted %>% sf::st_sf() %>%
  ggplot(aes(fill=value)) + facet_wrap("name") +
  geom_sf(size=0.1) + scale_fill_gradient2(labels=scales::percent) +
  coord_sf(datum=NA,xlim=c(-123.25,-122.8),ylim=c(49.1,49.35)) +
  labs(title="Change in share of families in low income 2004-2018", fill=NULL,
       caption="T1FF F-20 family file")

Example on your Own

Now that we’ve gone through a lot of examples, we’ll have you all try out one on your own. In the meantime, I’ll show you through the steps again (if you need it) and answer any questions you have. Here’s the goal:

  1. Find one Census Year & Vector that you want to analyze. Either with the CensusMapper in the browser, or with the find_census_vectors() function.

  2. Pick a geography you’ll want to map. You can again use CensusMapper to grab the GeoID in the browser, or you can use the list_census_regions() function here in R.

  3. Map the data. Choose the level of geographic detail you want with the level="CT/CMA/etc." code.

  4. Good luck!

Extra Example of my Own:

pv <- c(households="v_CA21_4237")

census_data <- get_census("CA21", regions=list(CMA="59933"), vectors=pv, geo_format="sf", level="CT")

ggplot(census_data,aes(fill=households)) +
  geom_sf(size=NA) +  
  labs(title="# of Households in Vancouver",fill=NULL,caption="StatCan Census 2021")+ scale_fill_viridis_c()